A regional bank with branches in Yaroslavl and the regional cities of Rostov and Rybinsk wants to know how customers use its services.
The purpose of the study:
Research objectives:
Available information:
The research process:
# importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import phik
from phik import resources, report
from phik.report import plot_correlation_matrix
import plotly.express as px
from scipy import stats as st
# saving the dataset to the 'data' variable
try:
data = pd.read_csv('/datasets/bank_dataset_eng.csv', sep=',')
except:
data = pd.read_csv('bank_dataset_eng.csv', sep=',')
# the first five rows of the dataset
display(data.head())
# checking the number of rows in the dataset
print("Total rows in the dataset:", data.shape[0])
| userid | score | City | Gender | Age | Objects | Balance | Products | CreditCard | Loyalty | estimated_salary | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15677338 | 619 | Yaroslavl | F | 42 | 2 | NaN | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 15690047 | 608 | Rybinsk | F | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 15662040 | 502 | Yaroslavl | F | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 15744090 | 699 | Yaroslavl | F | 39 | 1 | NaN | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 15780624 | 850 | Rybinsk | F | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Total rows in the dataset: 10000
Column headers are in different styles, it is not always clear what data they describe.
# basic information
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userid 10000 non-null int64 1 score 10000 non-null int64 2 City 10000 non-null object 3 Gender 10000 non-null object 4 Age 10000 non-null int64 5 Objects 10000 non-null int64 6 Balance 6383 non-null float64 7 Products 10000 non-null int64 8 CreditCard 10000 non-null int64 9 Loyalty 10000 non-null int64 10 estimated_salary 10000 non-null float64 11 Churn 10000 non-null int64 dtypes: float64(2), int64(8), object(2) memory usage: 937.6+ KB
The dataset contains numerical and categorical features, as well as categorical features that are encoded by numbers (for example, the presence of a credit card or accuracy). There are missing values in the Balance column.
# descriptive statistics
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| userid | 10000.0 | 1.573368e+07 | 71936.186123 | 15608437.00 | 1.567126e+07 | 1.573347e+07 | 1.579597e+07 | 15858426.00 |
| score | 10000.0 | 6.505288e+02 | 96.653299 | 350.00 | 5.840000e+02 | 6.520000e+02 | 7.180000e+02 | 850.00 |
| Age | 10000.0 | 3.892180e+01 | 10.487806 | 18.00 | 3.200000e+01 | 3.700000e+01 | 4.400000e+01 | 92.00 |
| Objects | 10000.0 | 5.012800e+00 | 2.892174 | 0.00 | 3.000000e+00 | 5.000000e+00 | 7.000000e+00 | 10.00 |
| Balance | 6383.0 | 1.198275e+05 | 30095.056462 | 3768.69 | 1.001820e+05 | 1.198397e+05 | 1.395123e+05 | 250898.09 |
| Products | 10000.0 | 1.530200e+00 | 0.581654 | 1.00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.00 |
| CreditCard | 10000.0 | 7.055000e-01 | 0.455840 | 0.00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.00 |
| Loyalty | 10000.0 | 5.151000e-01 | 0.499797 | 0.00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.00 |
| estimated_salary | 10000.0 | 1.000902e+05 | 57510.492818 | 11.58 | 5.100211e+04 | 1.001939e+05 | 1.493882e+05 | 199992.48 |
| Churn | 10000.0 | 2.037000e-01 | 0.402769 | 0.00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.00 |
The minimum salary of 11.58 rubles looks like an erroneous value.
There are 10,000 records in the dataset.
There are twelve columns in total:
userid — user ID;score — credit scoring points;City — city;Gender — gender;Age — age;Objects — number of objects owned;Balance — account balance;Products — the number of products used by the client;CreditCard — is there a credit card;Loyalty — active client;estimated_salary — client's salary;Churn — has the client left or not.Column names do not correspond to the accepted style. There are missing values in the Balance column. There are also categorical features that are encoded by numbers, this should be taken into account for further work with the data.
# renaming the dataframe columns, bringing them to the accepted style
data = data.rename(columns={'userid': 'user_id',
'score': 'credit_scoring',
'City': 'city',
'Gender': 'gender',
'Age': 'age',
'Objects': 'objects_owned',
'Balance': 'balance',
'Products': 'products',
'CreditCard': 'credit_card',
'Loyalty': 'loyalty',
'Churn': 'churn'})
There are missing values in the balance column. Perhaps these are churn clients. Let's check the hypothesis.
# we output the number of missing values in the 'balance' column among the churn and non-churn clients
print(data.query('balance.isna() and churn == 1', engine='python').shape[0])
print(data.query('balance.isna() and churn == 0', engine='python').shape[0])
500 3117
There are more missing values among non-churn clients. Let's see if there is a correlation between the balance and salary to fill in the missing values.
data['balance'].corr(data['estimated_salary'])
-0.0018564885821502858
The balance does not correlate with salary. Let's check if the missing values are related to specific branches of the bank.
# group the rows with the 'balance' missing value by city and count the number
data.query('balance != balance').groupby('city').agg(number=('city', 'count')).reset_index()
| city | number | |
|---|---|---|
| 0 | Rybinsk | 1199 |
| 1 | Yaroslavl | 2418 |
All the missing values in the balance column are in Rybinsk and Yaroslavl branches. The reason may be that the bank's branch in Rostov worked only with salary clients.
We are leaving the missing values as they are.
# checking for duplicates
data.duplicated().sum()
0
The estimated_salary column has an abnormally low minimum. Let's look at the box-and-whisker diagram.
data[['estimated_salary']].boxplot()
<AxesSubplot:>
There are no outliers. The 25th percentile is 50 000 rubles. Let's look at the 2nd and 1st percentiles.
print(np.percentile(data['estimated_salary'], [2, 1]))
[3769.4488 1842.8253]
One percent of clients from the dataset have a salary estimated by the bank to be less than or equal to 1,842 rubles. Perhaps this is due to the peculiarities of salary assessment or the fact that the client works part-time. We will delete records from the dataset with a salary of less than 1000 rubles: such a salary looks abnormal even for Russian small towns.
# checking what percentage of data will be deleted
print("Percentage of deleted data:", round(data.query('estimated_salary < 1000').shape[0] * 100 / data.query('estimated_salary >= 1000').shape[0], 2))
Percentage of deleted data: 0.59
data = data.query('estimated_salary >= 1000')
# checking the number of remaining records
data.shape[0]
9941
There are no duplicates in the data. The dataset was processed as follows:
balance column were left unprocessed, because, perhaps, they are related to the peculiarities of the work of the branch in Rostov.There are 9,941 records left in the dataset after processing.
We look at the distribution of the following features: credit scoring points, age, number of objects owned, balance, salary.
features = ['credit_scoring', 'age', 'objects_owned', 'balance', 'estimated_salary']
titles = ['credit scoring points', 'age', 'number of objects owned', 'account balance', 'salary']
plt.style.use('seaborn')
for i in range(len(features)):
plt.figure(figsize=(12,7))
sns.histplot(data = data[features[i]], bins=50)
plt.title('Distribution histogram: ' + titles[i],fontsize=15)
plt.ylabel('Number of clients',fontsize=12)
plt.xlabel(titles[i].capitalize(),fontsize=12)
plt.show()
Let's see how many products customers use.
# we group the data by the number of products and count the number of customers for each group
product_number = (data.groupby('products') \
.agg(clients=('products', 'count')) \
.sort_values(by='clients', ascending=False)) \
.reset_index()
# barplot with the number of clients depending on the number of products they use
plt.figure(figsize=(9,7))
plt.title('The number of bank clients by the number of products they use',fontsize=15)
g = sns.barplot(data=product_number, x='products', y='clients')
for p in g.patches:
g.annotate(round(p.get_height()), xy=(p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.xlabel('Number of products',fontsize=12)
plt.ylabel('Number of clients',fontsize=12);
The vast majority of clients use one (5,053 clients) or two (4,562 clients) products.
Let's look at the distribution of clients by cities where there are bank branches.
# we group the data by city, count the number of clients
cities = (data.groupby('city') \
.agg(clients=('city', 'count')) \
.sort_values(by='clients', ascending=False) \
.reset_index())
cities
| city | clients | |
|---|---|---|
| 0 | Yaroslavl | 4980 |
| 1 | Rostov | 2492 |
| 2 | Rybinsk | 2469 |
# barplot with the name of the cities and the number of clients
plt.figure(figsize=(10,5))
plt.title('The number of clients in the cities where there are branches of the bank',fontsize=15)
g = sns.barplot(data=cities, x='clients', y='city', orient='h')
for p in g.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
g.text(x+width + 20,
y+height/2,
'{:.0f}'.format(width),
horizontalalignment='left',
verticalalignment='center')
plt.xlabel('Number of clients',fontsize=12)
plt.ylabel('City',fontsize=12);
Half of the bank's clients are from Yaroslavl, approximately 25% of clients are in Rostov and Rybinsk.
Let's look at the distribution of features such as gender, credit card availability, loyalty, churn (whether the client left or not).
# barplots for each of the features
features = ['gender', 'credit_card', 'loyalty', 'churn']
titles = ['gender', 'credit_card', 'loyalty', 'churn']
titles_x = ['Gender', 'There is a credit card', 'Loyal client', 'The client left']
for i in range(len(features)):
plt.figure(figsize=(6,6))
plt.title('Number of bank clients depending on ' + titles[i],fontsize=15)
df = (data.groupby(features[i]) \
.agg(clients=(features[i], 'count')) \
.sort_values(by='clients', ascending=False) \
.reset_index())
g = sns.barplot(data=df, x=features[i], y='clients')
for p in g.patches:
g.annotate(round(p.get_height()), xy=(p.get_x() + p.get_width() / 2, p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.xlabel(titles_x[i],fontsize=12)
plt.ylabel('Number of clients',fontsize=12)
plt.show();
Among the bank's clients, there are almost a thousand more men than women (5427 and 4514, respectively). 70% of customers have a credit card. The number of loyal and disloyal clients is almost the same with a slight advantage in the direction of loyal. About 20% of the bank's customers had left at the time of the research.
Before looking at the correlation of features, we'll transform the categorical features of gender and city into a suitable form using fast coding.
We will also apply fast coding to the products column: since our goal is to segment the bank's customers by the number of products, we will check how the number of products correlates with other characteristics.
# changing the gender designation to 0 and 1
data.loc[data['gender'] == 'F', 'gender'] = 0
data.loc[data['gender'] == 'M', 'gender'] = 1
data['gender'] = data['gender'].astype('int')
# converting the name of the city using fast encoding, saving it to the 'data_new' variable
data_new = pd.get_dummies(data, columns=['city'])
# converting the number of products using fast coding
data_new = pd.get_dummies(data_new, columns=['products'])
# first rows of the new dataset
data_new.head()
| user_id | credit_scoring | gender | age | objects_owned | balance | credit_card | loyalty | estimated_salary | churn | city_Rostov | city_Rybinsk | city_Yaroslavl | products_1 | products_2 | products_3 | products_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15677338 | 619 | 0 | 42 | 2 | NaN | 1 | 1 | 101348.88 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 15690047 | 608 | 0 | 41 | 1 | 83807.86 | 0 | 1 | 112542.58 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 2 | 15662040 | 502 | 0 | 42 | 8 | 159660.80 | 1 | 0 | 113931.57 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 15744090 | 699 | 0 | 39 | 1 | NaN | 0 | 0 | 93826.63 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 15780624 | 850 | 0 | 43 | 2 | 125510.82 | 1 | 1 | 79084.10 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
# we leave only those records where the balance is not a missing value
data_new = data_new.query('balance == balance')
Let's build on the correlation matrix of features.
data_new = data_new.drop(columns=['user_id'])
plt.figure(figsize=(15,15))
sns.heatmap(data_new.corr(), linewidth = 1, annot=True);
The heat map shows that the features are almost uncorrelated. Perhaps the features have non-linear connections. Let's check the correlation between the signs using the Phik correlation coefficient and build a heat map.
plt.figure(figsize=(15,15))
sns.heatmap(data_new.phik_matrix(), linewidth = 1, annot=True);
interval columns not set, guessing: ['credit_scoring', 'gender', 'age', 'objects_owned', 'balance', 'credit_card', 'loyalty', 'estimated_salary', 'churn', 'city_Rostov', 'city_Rybinsk', 'city_Yaroslavl', 'products_1', 'products_2', 'products_3', 'products_4']
Let's look at the churn. There is a moderate positive correlation with age (the highest value among all features), as well as with the "3 products" feature.
In general, Rostov differs from Yaroslavl and Rybinsk on the basis of "churn" feature.
Let's build a scatter plot matrix based on the main quantitative features and add a color designation depending on the value of the churn column.
df = data_new[['credit_scoring', 'age', 'balance', 'estimated_salary','churn']]
sns.pairplot(df, hue='churn');
We can say that the clients who left have the same distribution of credit scoring, balance and salary as those who remained, but the median age of those who left is higher: 45 years against 36 for the remaining clients. This may be due to the fact that the churn clients used only one product (for example, they took out a loan and paid it off), when the need for it disappeared, such customers left.
It is also noticeable that a significant proportion of the exact customers are aged from 40 to 65 years. After 65, their share decreases.
"Churn" feature is directly correlated with age, as well as with "three products". There is a weak direct correlation with the "four products" feature and the city of Rostov, as well as a weak inverse correlation with loyalty, the "two products" feature, the city of Yaroslavl and gender.
For clients who left, the distribution of credit scoring, balance and salary is the same as for the remaining ones, but the median age of those who left is higher: 45 years versus 36 for the remaining clients. This may be due to the fact that the churn clients used only one product (for example, they took out a loan and paid it off), when the need for it disappeared, such clients left.
A significant proportion of the churn clients are aged from 40 to 65 years. After 65, their share decreases.
We divide users into segments by the number of products, by age, and also by bank branch.
From the data analysis, we know that most clients use one (5,503) or two (4,562) bank products. A very small proportion of clients use three (266) or four (60) products.
Also, according to the correlation matrix of features, we can say that the features "three products" and "four products" are similar to each other and differ from the other signs (one and two products).
Therefore, we will divide customers into three segments according to the number of products they use:
# adding a new 'product_segment' column to the dataset
data['product_segment'] = data['products'].apply(lambda x: str(x) if x == 1 or x == 2 else '3+')
# checking the updated dataset
data.head()
| user_id | credit_scoring | city | gender | age | objects_owned | balance | products | credit_card | loyalty | estimated_salary | churn | product_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15677338 | 619 | Yaroslavl | 0 | 42 | 2 | NaN | 1 | 1 | 1 | 101348.88 | 1 | 1 |
| 1 | 15690047 | 608 | Rybinsk | 0 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | 1 |
| 2 | 15662040 | 502 | Yaroslavl | 0 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | 3+ |
| 3 | 15744090 | 699 | Yaroslavl | 0 | 39 | 1 | NaN | 2 | 0 | 0 | 93826.63 | 0 | 2 |
| 4 | 15780624 | 850 | Rybinsk | 0 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | 1 |
Let's see how, in general, the bank's customers are divided into segments depending on the number of products.
# we group the data by segment, count the number of clients in each and build a pie chart
fig = px.pie(data.groupby('product_segment').agg(number=('age', 'count')).reset_index(),
values='number',
names='product_segment',
title='Segment shares by number of products',
width=400, height=400)
fig.show()
Let's look at the charts of the scope of credit scoring, age, number of objects owned, balance and salary in the context of segments by the number of products.
features = ['credit_scoring', 'age', 'balance', 'objects_owned', 'estimated_salary']
titles = ['credit scoring', 'age', 'balance', 'objects owned', 'estimated salary']
for i in range(len(features)):
plt.figure(figsize=(9,6))
plt.title('Boxplot: ' + titles[i],fontsize=15)
g = sns.boxplot(data=data.sort_values(by='product_segment'),
x='product_segment',
y=data[features[i]])
plt.xlabel('Segments by number of products',fontsize=12)
plt.ylabel(titles[i].capitalize(),fontsize=12);
On the boxplots, we see that the segments practically do not differ in such features as credit scoring, account balance and salary.
There are differences in age: the "youngest" segment — customers with two products, the most "adult" — with three and four . Median age of clients:
Also, with the same median number of objects owned by clients of all segments,
Let's see how these three segments are distributed across cities.
# we group the dataset by cities and segments, count the number and percentage of customers in each segment,
# and save it to the 'cities' variable
cities = data.groupby(['city', 'product_segment']).agg(clients = ('product_segment', 'count')).reset_index()
cities['percent'] = round(cities['clients'] / cities.groupby('city')['clients']. transform('sum'), 2) * 100
# plotting a graph with the share of each segment in cities
fig = px.bar(cities,
x="percent",
y="city",
color="product_segment",
orientation='h', width=1000, height=400,
color_discrete_sequence=px.colors.qualitative.T10,
text = cities['percent'].map("{:,}%".format),
labels={"percent": "Percentage of customers",
"city": "City",
"product_segment": "Products"},
title="Percentage of customers of each product segment by city")\
.update_yaxes(categoryorder="min ascending")
fig.show()
In Yaroslavl and Rybinsk, the share of segments is almost the same, in Rostov there are slightly more segments with one product and with 3+, but less with two.
Let's look at the percentage of customers in each segment according to the following features: credit card availability, loyalty, churn.
features = ['credit_card', 'loyalty', 'churn']
titles_legend = ['Has a credit card', 'Loyal client', 'Сlient left']
titles = ['The share of customers with and without a credit card by product segments',
'Share of loyal and disloyal customers by product segments',
'The share of churned and non-churned customers by product segments']
for i in range(len(features)):
df = data.groupby(['product_segment', features[i]]).agg(clients = ('product_segment', 'count')).reset_index()
df['percent'] = round(df['clients'] / df.groupby('product_segment')['clients']. transform('sum') * 100)
df[features[i]] = df[features[i]].astype('str')
display(df.head())
fig = px.bar(df,
x="percent",
y="product_segment",
color=features[i],
orientation='h', width=1000, height=400,
color_discrete_sequence=px.colors.qualitative.T10,
text = df['percent'].map("{:,}%".format),
labels={"percent": "Percentage of customers",
"product_segment": "Number of products",
features[i]: titles_legend[i]},
title=titles[i])
#.update_yaxes(categoryorder="max ascending")
fig.show()
| product_segment | credit_card | clients | percent | |
|---|---|---|---|---|
| 0 | 1 | 0 | 1499 | 30.0 |
| 1 | 1 | 1 | 3554 | 70.0 |
| 2 | 2 | 0 | 1333 | 29.0 |
| 3 | 2 | 1 | 3229 | 71.0 |
| 4 | 3+ | 0 | 95 | 29.0 |
| product_segment | loyalty | clients | percent | |
|---|---|---|---|---|
| 0 | 1 | 0 | 2508 | 50.0 |
| 1 | 1 | 1 | 2545 | 50.0 |
| 2 | 2 | 0 | 2134 | 47.0 |
| 3 | 2 | 1 | 2428 | 53.0 |
| 4 | 3+ | 0 | 184 | 56.0 |
| product_segment | churn | clients | percent | |
|---|---|---|---|---|
| 0 | 1 | 0 | 3654 | 72.0 |
| 1 | 1 | 1 | 1399 | 28.0 |
| 2 | 2 | 0 | 4217 | 92.0 |
| 3 | 2 | 1 | 345 | 8.0 |
| 4 | 3+ | 0 | 46 | 14.0 |
The shares of clients with and without a credit card are almost equal in each segment: 70% of customers have a credit card.
Loyal customers are more among those who use two products — 53%, less among customers with 3+ products — 44%.
The highest share of churned customers in the segment of those who use three and four products is 86%. The lowest among those with two products is 8%.
We have divided customers into three segments depending on the number of products they use. The segments practically do not differ in such features as credit scoring, balance and salary, 70% of customers in each segment have a credit card. Let's describe the other features.
From the EDA, we know that a significant proportion of churned customers are aged 40 to 65 years. After 65, their share decreases.
Therefore, we will divide customers into three segments by age:
# function for categorizing clients by age
def categorize_age(age):
"""The function returns the age segment depending on the age value"""
try:
if 18 <= age < 40:
return '18-40'
elif 40 <= age < 65:
return '40-65'
elif 65 <= age:
return '65+'
except:
pass
# adding a new 'age_segment' column to the dataset using the 'categorize_age' function
data['age_segment'] = data['age'].apply(categorize_age)
# checking the function
data.head()
| user_id | credit_scoring | city | gender | age | objects_owned | balance | products | credit_card | loyalty | estimated_salary | churn | product_segment | age_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15677338 | 619 | Yaroslavl | 0 | 42 | 2 | NaN | 1 | 1 | 1 | 101348.88 | 1 | 1 | 40-65 |
| 1 | 15690047 | 608 | Rybinsk | 0 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | 1 | 40-65 |
| 2 | 15662040 | 502 | Yaroslavl | 0 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | 3+ | 40-65 |
| 3 | 15744090 | 699 | Yaroslavl | 0 | 39 | 1 | NaN | 2 | 0 | 0 | 93826.63 | 0 | 2 | 18-40 |
| 4 | 15780624 | 850 | Rybinsk | 0 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | 1 | 40-65 |
Let's see how, in general, the bank's customers were divided into segments depending on age.
# we group the data by age segment, count the number of customers in each and build a pie chart
fig = px.pie(data.groupby('age_segment').agg(number=('age', 'count')).reset_index(),
values='number',
names='age_segment',
title='Segment shares by age',
width=400, height=400)
fig.show()
The largest segment is of 18-40-year-old customers (almost 60%), 37% in the 40-65-year-old segment, the smallest segment is of 65+ years — about 3%.
Let's look at the boxplots of credit scoring, the number of objects owned, balance and salary by age segments.
features = ['credit_scoring', 'balance', 'objects_owned', 'estimated_salary']
titles = ['credit scoring', 'balance', 'objects owned', 'estimated salary']
for i in range(len(features)):
plt.figure(figsize=(9,6))
plt.title('Boxplot: ' + titles[i],fontsize=15)
g = sns.boxplot(data=data.sort_values(by='age_segment'),
x='age_segment',
y=data[features[i]])
plt.xlabel('Age segments',fontsize=12)
plt.ylabel(titles[i].capitalize(),fontsize=12);
On the boxplots, we see that the segments practically do not differ in salaries (we will consider this a feature of the training dataset).
The credit scoring is higher for the "65+" segment.
With the same median balance in all three segments, "65+" has a smaller scope — from 50 to 187 thousand rubles. Segments "18-40" and "40-65" have outliers both up and down.
Also, with the same median number of objects owned by customers of all segments,
Let's look how the age segments are distributed in each of the cities of the bank's presence.
# we group the dataset by cities and segments, count the number and percentage of customers in each segment,
# save it to the 'cities_age' variable
cities_age = data.groupby(['city', 'age_segment']).agg(clients = ('age_segment', 'count')).reset_index()
cities_age['percent'] = round(cities_age['clients'] / cities_age.groupby('city')['clients']. transform('sum')* 100, 1)
# graph with the share of each segment in cities
fig = px.bar(cities_age,
x="percent",
y="city",
color="age_segment",
orientation='h', width=1000, height=400,
color_discrete_sequence=px.colors.qualitative.T10,
text = cities_age['percent'].map("{:,}%".format),
labels={"percent": "Percentage of customers",
"city": "City",
"age_segment": "Age"},
title="Percentage of clients of each age segment by city")\
.update_yaxes(categoryorder="min ascending")
fig.show()
Rostov differs from Yaroslavl and Rybinsk in the distribution of age segments: in this city, the "18-40" segment is smaller than in the other two cities, and is 55% vs. 61% and 62%, and the "40-65" segment is larger — 43% vs. 37% and 35%.
Let's see how many products are used by customers in each age segment.
# we group the dataset by product and age segments, count the number and percentage of clients in each segment,
# save it to the 'age_products' variable
age_products = data.groupby(['age_segment', 'product_segment']).agg(clients = ('age', 'count')).reset_index()
age_products['percent'] = round(age_products['clients'] / age_products.groupby('age_segment')['clients']. transform('sum') * 100, 1)
# graph with the share of each segment in cities
fig = px.bar(age_products,
x="percent",
y="age_segment",
color="product_segment",
orientation='h', width=1000, height=400,
color_discrete_sequence=px.colors.qualitative.T10,
text = age_products['percent'].map("{:,}%".format),
labels={"percent": "Percentage of clients",
"age_segment": "Age",
"product_segment": "Number of products"},
title="Percentage of clients of each age segment by number of products")
fig.show()
Let's look at the percentage of customers in each segment according to the following features: credit card availability, loyalty, churn.
features = ['credit_card', 'loyalty', 'churn']
titles_legend = ['Has a credit card', 'Loyal customer', 'Churned client']
titles = ['Percentage of customers with and without a credit card by age segment',
'The share of loyal and disloyal customers by age segments',
'The share of churned and non-churned customers by age segments']
for i in range(len(features)):
df = data.groupby(['age_segment', features[i]]).agg(clients = ('age_segment', 'count')).reset_index()
df['percent'] = round(df['clients'] / df.groupby('age_segment')['clients']. transform('sum') * 100)
df[features[i]] = df[features[i]].astype('str')
display(df.head())
fig = px.bar(df,
x="percent",
y="age_segment",
color=features[i],
orientation='h', width=1000, height=400,
color_discrete_sequence=px.colors.qualitative.T10,
text = df['percent'].map("{:,}%".format),
labels={"percent": "Percentage of clients",
"age_segment": "Age",
features[i]: titles_legend[i]},
title=titles[i])
fig.show()
| age_segment | credit_card | clients | percent | |
|---|---|---|---|---|
| 0 | 18-40 | 0 | 1713 | 29.0 |
| 1 | 18-40 | 1 | 4239 | 71.0 |
| 2 | 40-65 | 0 | 1133 | 31.0 |
| 3 | 40-65 | 1 | 2576 | 69.0 |
| 4 | 65+ | 0 | 81 | 29.0 |
| age_segment | loyalty | clients | percent | |
|---|---|---|---|---|
| 0 | 18-40 | 0 | 2957 | 50.0 |
| 1 | 18-40 | 1 | 2995 | 50.0 |
| 2 | 40-65 | 0 | 1831 | 49.0 |
| 3 | 40-65 | 1 | 1878 | 51.0 |
| 4 | 65+ | 0 | 38 | 14.0 |
| age_segment | churn | clients | percent | |
|---|---|---|---|---|
| 0 | 18-40 | 0 | 5359 | 90.0 |
| 1 | 18-40 | 1 | 593 | 10.0 |
| 2 | 40-65 | 0 | 2321 | 63.0 |
| 3 | 40-65 | 1 | 1388 | 37.0 |
| 4 | 65+ | 0 | 237 | 85.0 |
Based on the results of EDA, we divided the clients into three age segments. The segments practically do not differ in salary (a feature of the dataset), about 70% of customers in each segment have a credit card. Let's describe the other features.
18-40 years old. This segment is the largest (59.9%). Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. Half of the customers use two products, 48% one, only 2% three and four. There are the equal number of loyal and disloyal customers, while proportion of churned customers is the lowest (10)%.
40-65 years old. This segment accounted for 37.3% of the total number of customers. Half of the clients have from 2 to 8 objects owned, 25% have 8 or more. 56% of customers use one product, 39% two, 5.5% three and four. Loyal and disloyal customers are approximately equal, the highest proportion of churned customers (37%).
65+ years old. The smallest segment — 2.82% of all the bank's customers. Customers in this segment are more reliable: the median credit score is 660 points. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. With the same median balance in all three segments, "65+" has a smaller scope — from 50 to 187 thousand rubles. 52% of customers have one product, 45% have two, 3.6% have 3 and 4. This segment has the highest share of loyal customers — 86%, churned customers — 15%.
At the stage of EDA, as well as user segmentation by product and age, we paid attention to the following:
in general, Rostov differs from Yaroslavl and Rybinsk on the basis of "churn" feature.
in Yaroslavl and Rybinsk, the share of segments in terms of the number of products is almost the same, in Rostov there are slightly more segments with one product and with 3+, but less with two.
Rostov differs from Yaroslavl and Rybinsk in the distribution of age segments: in this city, the "18-40" segment is smaller than in the other two cities, and is 55% vs. 61% and 62%, and the "40-65" segment is larger — 43% vs. 37% and 35%.
We can conclude that, in general, the client of the branch in Rostov differ in behavior from customers in Yaroslavl and Rybinsk. We will divide customers into segments by branches to look at the differences:
# we group the data by branch, count the number of customers in each and create a pie chart
fig = px.pie(data.groupby('city').agg(number=('city', 'count')).reset_index(),
values='number',
names='city',
title='Segment shares by branch',
width=500, height=500)
fig.show()
Half of the bank's customers account for Yaroslavl (the regional center of Yaroslavl region), in Rybinsk and Rostov about 25% of all customers.
Let's look at the boxplots of credit scoring, the number of objects owned, balance and salary by branch.
features = ['credit_scoring', 'age', 'balance', 'objects_owned', 'estimated_salary']
titles = ['credit scoring', 'age', 'balance', 'objects owned', 'estimated salary']
for i in range(len(features)):
plt.figure(figsize=(9,6))
plt.title('Boxplot: ' + titles[i],fontsize=15)
g = sns.boxplot(data=data,
x='city',
y=data[features[i]])
plt.xlabel('Segments by branch',fontsize=12)
plt.ylabel(titles[i].capitalize(),fontsize=12);
On the boxplots, we see that the salaries of clients in different branches practically do not differ.
Clients in Rostov have a higher median age of 38 years (in Yaroslavl and Rybinsk — 37), 25% of clients are over 45 years old (in Yaroslavl and Rybinsk over 43 and 44, respectively).
With the same median balance in all three segments, there is less scope in Rostov — 50% of customers have a balance from 103 to 138 thousand rubles.
Also, with the same median number of objects owned by customers of all branches,
Let's look at the percentage of clients in each branch according to the following features: credit card availability, loyalty, churn.
features = ['credit_card', 'loyalty', 'churn', 'age_segment', 'product_segment']
titles_legend = ['Credit card', 'Loyalty', 'Churn', 'Age segment', 'Product segment']
titles = ['Percentage of customers with and without a credit card by branch',
'Percentage of loyal and disloyal customers by branches',
'Percentage of churned and non-churned customers by branches',
'Percentage of customers by age',
'Percentage of customers by number of products']
for i in range(len(features)):
df = data.groupby(['city', features[i]]).agg(clients = ('city', 'count')).reset_index()
df['percent'] = round(df['clients'] / df.groupby('city')['clients']. transform('sum') * 100)
df[features[i]] = df[features[i]].astype('str')
display(df.head())
fig = px.bar(df,
x="percent",
y="city",
color=features[i],
orientation='h', width=1000, height=400,
color_discrete_sequence=px.colors.qualitative.T10,
text = df['percent'].map("{:,}%".format),
labels={"percent": "Percentage of customers",
"city": "City",
features[i]: titles_legend[i]},
title=titles[i])
fig.show()
| city | credit_card | clients | percent | |
|---|---|---|---|---|
| 0 | Rostov | 0 | 712 | 29.0 |
| 1 | Rostov | 1 | 1780 | 71.0 |
| 2 | Rybinsk | 0 | 752 | 30.0 |
| 3 | Rybinsk | 1 | 1717 | 70.0 |
| 4 | Yaroslavl | 0 | 1463 | 29.0 |
| city | loyalty | clients | percent | |
|---|---|---|---|---|
| 0 | Rostov | 0 | 1253 | 50.0 |
| 1 | Rostov | 1 | 1239 | 50.0 |
| 2 | Rybinsk | 0 | 1161 | 47.0 |
| 3 | Rybinsk | 1 | 1308 | 53.0 |
| 4 | Yaroslavl | 0 | 2412 | 48.0 |
| city | churn | clients | percent | |
|---|---|---|---|---|
| 0 | Rostov | 0 | 1686 | 68.0 |
| 1 | Rostov | 1 | 806 | 32.0 |
| 2 | Rybinsk | 0 | 2058 | 83.0 |
| 3 | Rybinsk | 1 | 411 | 17.0 |
| 4 | Yaroslavl | 0 | 4173 | 84.0 |
| city | age_segment | clients | percent | |
|---|---|---|---|---|
| 0 | Rostov | 18-40 | 1361 | 55.0 |
| 1 | Rostov | 40-65 | 1061 | 43.0 |
| 2 | Rostov | 65+ | 70 | 3.0 |
| 3 | Rybinsk | 18-40 | 1496 | 61.0 |
| 4 | Rybinsk | 40-65 | 903 | 37.0 |
| city | product_segment | clients | percent | |
|---|---|---|---|---|
| 0 | Rostov | 1 | 1339 | 54.0 |
| 1 | Rostov | 2 | 1033 | 41.0 |
| 2 | Rostov | 3+ | 120 | 5.0 |
| 3 | Rybinsk | 1 | 1217 | 49.0 |
| 4 | Rybinsk | 2 | 1179 | 48.0 |
Rostov differs from other branches in 'churn' feature: the share of churned customers for this branch is twice as high as for others (32%).
Such a high churn level is due to the fact that:
Such features of this particular city are most likely related to the demographic situation. According to Rosstat (Russian Federal State Statistics Service), as of 2022, the population of Rostov is 30 thousand people, and Rybinsk is 182 thousand. At the same time, the number of bank customers in these two cities is almost the same.
Clients from Rostov are older on average, because, most likely, young people leave for larger cities. At the same time, the city has the highest proportion of bank customers — 8% of residents, which can be explained by low competition among banks in the city, since it is relatively small.
We are testing two hypotheses:
In this case, we will test the hypothesis about the equality of the averages of the two general populations. We formulate the null and alternative hypotheses.
Null hypothesis: the average income values of customers who use two products of the bank and those who use one are equal.
Alternative hypothesis: the average income values of customers who use two products of the bank, and those who use one, are not equal.
product_1 = list(data.query('product_segment == "1"')['estimated_salary'])
product_2 = list(data.query('product_segment == "2"')['estimated_salary'])
alpha = .05 # significance level
# if the p-value turns out to be less than it, we reject the hypothesis
results = st.ttest_ind(
product_1,
product_2)
print('p-value:', results.pvalue)
if results.pvalue < alpha:
print("We reject the null hypothesis")
else:
print("It was not possible to reject the null hypothesis")
p-value: 0.4059540710452265 It was not possible to reject the null hypothesis
There is no difference in the average value of income for customers who use one and two products.
In this case, we will also test the hypothesis of the equality of the averages of the two general populations. We formulate the null and alternative hypotheses.
Null hypothesis: the average balance values of customers who use two bank products and those who use one are equal.
Alternative hypothesis: the average balance values of customers who use two products of the bank, and those who use one, are not equal.
product_1 = list(data.query('product_segment == "1" and balance == balance')['balance'])
product_2 = list(data.query('product_segment == "2" and balance == balance')['balance'])
alpha = .05 # significance level
# if the p-value turns out to be less than it, we reject the hypothesis
results = st.ttest_ind(
product_1,
product_2)
print('p-value:', results.pvalue)
if results.pvalue < alpha:
print("We reject the null hypothesis")
else:
print("It was not possible to reject the null hypothesis")
p-value: 0.7462515600712869 It was not possible to reject the null hypothesis
The average balance value of customers who use one product and two does not differ.
1. Study and preprocessing There are 10 000 records in the dataset. There are no duplicates. The dataset was processed as follows:
balance column were left unprocessed, because they may be related to the peculiarities of the work of the branch in Rostov.There are 9 941 records left in the dataset after processing.
2. Exploratory Data Analysis
"Churn" feature is directly correlated with age, as well as with the "three products" feature. There is a weak direct correlation with the "four products" feature and the city of Rostov. The median age of the churned clients is higher: 45 years versus 36 for the remaining clients. A significant proportion of the churned customers are aged from 40 to 65 years. After 65, their share decreases.
3. Customer segmentation
The segments practically do not differ in such features as credit scoring, balance and salary, 70% of customers in each segment have a credit card.
One product. The number of such clients is the majority — 50.8%. The median age is 38 years. Half of the clients have from 2 to 7 objects owned, 25% have 7 or more. There are the same number of loyal and disloyal customers. The share of churned customers is 28%.
Two products. Clients with two products — 45.9%. The median age is 36 years. Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. The highest share of loyal customers (53%). The lowest share of churned customers (only 8%).
Three and four products. The smallest segment: only 3.28% of all the bank's customers. The median age is 43 years. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. In this segment, the lowest share of loyal customers (44%), and the highest share of the churned clients (86%).
18-40 years old. The number of such clients is the majority — 59.9%. Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. Half of the customers use two products, 48% one, only 2% three and four. There are the same number of loyal and disloyal customers, the lowest proportion of churned customers (10%).
40-65 years old. This segment accounted for 37.3% of the total number of customers. Half of the clients have from 2 to 8 objects owned, 25% have 8 or more. 56% of customers use one product, 39% two, 5.5% three and four. Loyal and disloyal customers are approximately equal, the highest proportion of churned customers (37%).
65+ years old. The smallest segment — 2.82% of all the bank's customers. Clients in this segment are more reliable: the median credit score is 660 points. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. 52% of customers have one product, 45% have two, 3.6% have 3 and 4. This segment has the highest share of loyal customers — 86%, churned customers — 15%.
We also concluded that, in general, the customers of the branch in Rostov differ in behavior from customers in Yaroslavl and Rybinsk, and studied the differences.
Rostov differs from other branches in "churn" feature: the share of churned customers for in branch is twice as high as in others (32%).
Such a high churn rate is due to the fact that:
Such features of this particular city are most likely related to the demographic situation. According to Rosstat (Russian Federal State Statistics Service), as of 2022, the population of Rostov is 30 thousand people, and Rybinsk is 182 thousand. At the same time, the number of bank clients in these two cities is almost the same.
Clients from Rostov are older on average, because, most likely, young people leave for larger cities. At the same time, the city has the highest proportion of bank customers — 8% of residents, which can be explained by low competition among banks in the city, since it is relatively small.
4. Hypothesis testing
We tested two hypotheses:
Both hypotheses were not confirmed: there were no differences in the average income and average balance of customers with two and one products.
5. Recommendations
Develop a product aimed at customers aged 40-65 years: more than half of these customers use only one product (most likely, a loan) and they have the highest percentage of churn rate (they paid off the loan and left). Such a product can be debit cards with cashback, loans for the education of children with a reduced rate, mortgage loans.
Offer a new product to older customers from Rostov, for example, increased deposit rates for retired people. This will help to reduce churn, because the "65+" age segment has the highest loyalty.